Multifractal analysis of nonhyperbolic coupled map lattices: Application to genomic 

sequences 
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Symbolic sequences generated by coupled map lattices (CMLs) can be used to model the chaotic- 
like structure of genomic sequences. In this study it is shown that diffusively coupled Chebyshev 
maps of order 4 (corresponding to a shift of 4 symbols) very closely reproduce the multifractal 
spectrum Dq of human genomic sequences for coupling constant a = 0.35 ± 0.01 if g > 0. The 
presence of rare configurations causes deviations for g < 0, which disappear if the rare event statistics 
of the CML is modified. Such rare configurations are known to play specific functional roles in 
genomic sequences serving as promoters or regulatory elements. 

PACS numbers: 89.75.Fb (Structure and organization in complex systems); 05.45.Df (Fractals); 05.45.Ra 
(Coupled Map Lattices); 87.14.gk (DNA). 



I. INTRODUCTION 

Coupled Map Lattices (CMLs) are frequently used as 
models for complex, often chaotic spatial and dynami- 
cal structures observed in diverse physical systems ■ 
Particular CMLs have been used to model hydrodynamic 
systems, chemical kinetics, biological systems, and field 
theoretical models These types of models arise nor- 

mally in situations where the nonlinear nature of the phe- 
nomena is complimented by a nontrivial underlying spa- 
tial geometry. Of particular interest are non-hyperbolic 
coupled map lattices, where the local map is allowed to 
have one or several points with zero slope. 

In the current study we investigate the behavior of 
coupled 4-th order Chebyshev maps, T4, and compare 
their multifractal spectra with that of DNA sequences. 
In fact, it is well-known that the dynamics of T/^ is equiv- 
alent to a Bernoulli shift of 4 symbols. Thus the coupled 
map dynamics with T4 as a local map corresponds to a 
coupled shift of information that is encoded by 4 sym- 
bols. In this respect it is natural to study the potential 
correspondence of statistics generated by coupled T4 as 
compared to that of genomic sequences composed of the 
four symbol- nucleotides, namely Adenine (A), Cytosine 
(C), Guanine (G) and Thymine (T). Earlier studies on 
the primary structure of DNA have shown that the statis- 
tics of genomic sequences exhibits nontrivial correlations 
and cannot be reproduced by a pure random stochas- 
tic process involving 4 symbols |7h25| . A natural way to 
gradually introduce correlations in the phase space struc- 
ture of T4 is via coupling of many r4 maps on a lattice. 
In our approach nontrivial correlations are introduced by 
means of a coupling constant a which diffusively couples 
nearest neighbor maps on the lattice and takes values 
< a < 1. 

Chebyshev maps are known to exhibit the strongest 
possible chaotic behavior characterized by a minimum 
skeleton of higher-order correlations (2S . i27j | . For weak 
coupling a analytic results have been previously derived 
on the perturbed invariant 1-point density |28l| and on 



the existence of periodic orbits [29j . These investigations 
provide motivation for a discussion of the possibility of 
CMLs to reproduce similar statistics as observed in ge- 
nomic sequences for finite values of the coupling constant. 
In this study, we investigate the multifractal spectrum 
resulting by appropriately sectioning the phase space of 
the CML to assimilate 4-symbol sequences. The choice 
of the multifractal spectrum as the relevant observable 
is particularly suitable for comparing genomic and CML 
sequences because it reveals the characteristic details of 
moments and symbol correlations of all orders. 

In the next section we first recall the multifractal spec- 
trum of a single Chebyshev map and we further explore 
the spectra of coupled Chebyshev maps on a 1-D lattice 
with periodic boundary conditions. In section IIIII a 1- 
1 correspondence is introduced between 4 appropriately 
chosen sections of the local CML phase space and the 4 
symbols of an artificial genomic sequence. The multifrac- 
tal spectra of entire human chromosomes are compared 
with the CML spectra for various values of a. Coupling 
values of the order of a ~ 0.34 — 0.36 are shown to yield 
multifractal spectra closely approximating the correla- 
tions in DNA sequences for positive q. In section IIVI it 
is shown that rare configurations need to be introduced 
in the CML dynamics to closely approximate the DNA 
spectra for both positive and negative q. In the conclud- 
ing section the final results are summarized and open 
problems are discussed. 



II. MULTIFRACTAL SPECTRA OF CMLS 

A. Multifractal Spectrum of 4-th order Chebyshev 

map 

The dynamics of the T4 map is generated by the re- 
currence relation 



Xn+l = T4,{Xn) = S,X^ ~ Sl + 1, 

with n = 0,1,2..., -1 < a;o < 1. 



(1) 
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Xn S [~ 1 ■ 1] is a continuous variable and takes values 
in the interval [—1,1] and n is a discrete time variable. 
This map is known to show strongest possible chaotic 
behavior. The multifractal spectrum generated by its 
invariant density is known analytically to take the form 
MM: 



1, for q<2 

^i, for q>2 



(2) 



Generally the Renyi (multifractal) dimensions are defined 
as 



Dq = lim logV'pf 

^ e^O Q - Hog e 



(3) 



where pi = ^^_^j^^^^^p{x)dx are the probabilities associ- 
ated with a partitioning of the phase space into boxes of 
equal size e. The multifractal spectrum given by Eq. ([2]) 
is easily obtained from the invariant probability density 

P{x) - J-^ . -1 < X < 1, (4) 

TTV 1 — X^ 



see, e.g. j3,C|. The presence of two singularities of 
the probability density at a; = ±1 produces a phase 
transition-like point oi Dq aX q = 2, see Eq. ^ and 
the corresponding Dq vs. q diagram in Fig. [T] This 
multifractal spectrum is formally obtained in the limit of 
infinitesimal (e — ?> 0) segmentation of the interval [—1,1] 
where is defined. This idealized limit is hardly observ- 
able in finite size systems, as is demonstrated in Fig. [T] 
In particular, genomic sequences are finite in size, having 
a definite number of nucleotides, hence it is not possible 
to achieve infinitesimally small segmentations. For com- 
parison with real data it is useful to explore finite size 
effects, for small but nonzero values of e. These are de- 
picted in Fig. [TJ For statistical reasons, large numbers of 
L = 10^ or 10^ uncoupled Chebyshev maps were consid- 
ered, iterated over 5000 time steps with random initial 
values. The analytical result (black dotted curve) is ob- 
tained in the limit e — > and L — > c». 

For finite e > the abrupt critical point behavior is 
deformed into a smooth but rapidly changing curve in 
the region of < <; < 2. The second derivative of Dq as a 
function of q is sometimes observed to switch sign. Such 
'humps' have also been observed for the Renyi dimensions 
and Renyi entropies associated with symbol sequences of 
the human genome, see Refs. [25l. |32|. [33| for more details. 

Generally, the shape of numerically determined Dq 
spectra of non-hyperbolic maps is heavily influenced by 
finite size effects, and we expect similar finite size ef- 
fects to be present for multifractal spectra of genomic 
sequences. 



B. Multifractal Spectra of Linearly Coupled 
Chebyshev Maps 

Our interest in coupling Chebyshev maps results from 
the tendency of local interactions in systems with mul- 
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Figure 1: (Color online) The multifractal spectrum of uncou- 
pled {a = 0) T4S as obtained numerically for finite e and A*'. 
The solid vertical line is the line q — 2. The dotted curve rep- 
resents the analytical expression of the multifractal spectrum, 
Eq. @. 



tiple components. In the current study we consider the 
simplest possible diffusive nearest neighbor coupling on a 
1-D lattice with periodic boundary conditions. Namely, 
we assume a linear chain of units ('particles') each of 
which is labelled by the index i. Each unit evolves ac- 
cording to Eq. ([T]) with additional, equal contributions 
from the left and right nearest neighbor particles. In the 
linear chain of size L, a coupling a is introduced so that 
the variable of the i — th unit follows the recurrence 
relation 



x^+i = (1 - a)T4«) + ^ (r4«+i) H- r4«_i)) (5) 



As initial conditions, a random distribution of Xq G 
[—1,1] is assumed. When Chebyshev maps are cou- 
pled on lattices the invariant 1-point densities are gradu- 
ally deformed and singularities tend to smooth out [28| . 
In particular, for a = 1 there are no singularities and 
Dq = l,Vg, while the case a = corresponds to a col- 
lection of independent r4's and the spectrum is given by 
the relation ([2]). 

In Fig. [2] a series of multifractal spectra for different 
values of a are shown, for finite but small values of e. 
L = 10^ coupled T4 maps are taken into account with 
e = 10^^ and e = 10"'^ in Figs. [5^ and [2b, respectively. 
In both cases, as a — > 1, the multifractal spectrum is 
seen to become uniform. In Fig. humps are observed 
due to finite size effects (e as large as 10^'^), for differ- 
ent values of a. Note, in Fig. [5)3, that the humps are 
more evident for higher values of the coupling constant 
a. These a- values are consistent with those that repro- 
duce similar statistics as DNA sequences, as will be seen 
in the subsequent sections. 
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Figure 2: (Color online) a) Numerically obtained multifractal 
spectra of coupled Ti's for various values of a, e = 10~^. b) 
Humps in the multifractal spectra for coupled T4 are observed 
for relatively large values of e (here e = 10~^ is shown). L = 
10^ in both a and b plots. 



III. SYMBOLIC SEQUENCES RESULTING 
FROM CMLS 



Having analyzed multifractal spectra that are directly 
associated with the local distribution of the state vari- 
ables xjj of the CML we now want to go a step further 
and produce symbol sequences from the CML. Compar- 
ing with the multifractal spectra of genomic sequences 
[32h35| | we note that there are certain similarities in the 
two spectra which suggests to explore the possibility of 
a certain chaotic CML processes to reproduce the most 
important genomic spectral features. 

As a first step in this direction one has to reconstruct 
a symbolic sequence of 4 letters, based on the distribu- 
tion of the local CML variable x. When constructing 
the artificial symbolic sequence one needs, at least, to re- 
spect the symbol concentrations of the original genomic 
sequence. If the genomic populations (mean concentra- 
tions) of the 4 nucleotides are denoted as pa,Pc,PG,Pt — 
1 — PA — PC — Pg for Adenine, Cytosine, Guanine and 
Thymine, respectively, then the artificial T4-based ge- 
nomic sequence should contain the same frequency of 
the symbols. To achieve consistency between the genome 



basepair population and the symbolic sequence one needs 
to consider again the 1-point distribution p(a;) of the cou- 
pled T4 map. The interval [—1,1] is segmented into four 
subintervals [— l,xi] ,[xi,X2],[x2,X3] and [0:3, 1], to ac- 
commodate the 4 basepairs. The values of xi,X2 and X3 
were chosen to fulfill the basepair frequency constraints: 



J^]^ dx p{x) 

L2 P{x) 
jl dx p{x) - 



--PA 
-- PC 
PC 

Pt 



(6) 



Having fixed the segmentation values [xi, a;2, X3], the i-ih. 
symbol S^^ of the artificial genomic sequence is chosen as 



yl if -1 < x]^ < xi 

C if xi < x\^ < X2 

G if X2 < xl^ < X3 

T if X3<xi,<l 



(7) 



Hence, a symbolic sequence S'„(z),i = 1, ...L is produced 
which on the one hand carries the complexity of the 
CML and on the other hand respects the average con- 
centrations of the DNA sequence under consideration, 

{pa,Pc,Pg,Pt)- 

In the search for a proper value of the coupling con- 
stant a which best describes the complexity of the 
genomic sequences, it is important to create CML- 
generated sequences of length comparable with the ge- 
nomic ones. In the current study sequences of L = 10^ 
were produced to assimilate the chromosomal DNA. To 
avoid transient phenomena and to ensure that the CML 
Chebyshev maps have unfolded all their chaotic state 
space structure, we have chosen in the simulations to 
use the results produced after n = 5000 iteration time 
steps. This is a safe choice because, normally, the CML 
sequences achieve their typical long-term behavior after 
about 20 iteration time steps. Averages over time steps 
n were not performed. Rather, our aim was to analyze 
a given snapshot of symbols generated by the CML sys- 
tem that spatially had comparable length to that of the 
genomic sequence. 

To locate the coupling constant which best reproduces 
the complexity of the genomic sequences, we calculated 
the multifractal spectra of the genomic sequence and 
compared it with the spectra of the CML-symbol se- 
quences for various values of a. The estimation of the 
multifractal exponents Dq are based on the calculation 
of the probabilities p(ji, «2, ■■■iN) of finding blocks of sym- 
bols zi, 12, ■■■In along the sequence of size L, whereas N 
is the linear size of the block [23, ^31 : 



log X! [pih,i2,---iNW 



D„ 



lim 



q ^ 1(8) 



g — 1 N^oo log E 

log ^ p(ii,i2,...«Ar)ln[p(ji,i2,...iAr)] 



D\ — lim ■ 



ll,J2,...4jv 
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p^=0.291921, p^=0.207966, p^=0.207859, p^=0.292219 
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Figure 3: (Color online) The multifractal spectra for human 
Chromosome 10, with iV = 8 (black hne) and CML-r4's for 
various values of the coupling constant a. 

In Eq. ([9]) E represents the size (total number of con- 
figurations) of the statespace. In this representation the 
exponents Dq represent the increase of the phase space 
when the size of the sequence (or window) increases. As 
an example we consider the homogeneous case, where 

P(il,«2,---*Af) = ] 

E^S^ \ ^ Dq = l 

s — the number of symbols = 4 J 

as is expected for homogeneous sequences. 

In Fig. |3]the multifractal spectrum of the human Chro- 
mosome 10 is plotted and compared with CML-T4 for 
various values of a. The calculations of the spectrum of 
chromosome 10 are based on evaluating all possible sym- 
bol sequences up to length N = 8. This is considered 
as asymptotic behavior since the numerical result does 
not change for values TV > 6, as was previously shown 
in references [H, [s^ . For the calculation of the various 
spectra based on the T4 map the 1-point probabilities 
of the basepairs in chromosome 10 have been respected, 
via Eqs. (|6]) and (O, by choosing appropriate borders 
xi,X2,X3 of the intervals. The observed 1-point proba- 
bilities for chromosome 10, used in this study, are 33] 



Figure 4: (Color online) The multifractal spectra for two ar- 
tificial DNA sequences, produced via the CML — T4. The 
coupling constant is a = 0.35. The dashed (red) line repre- 
sents equal composition of the four basepairs pa = Pc ~ Pa ~ 
Pt = 0.25, while the solid (black) line represents composition 
given by Eq. (9). 



In addition, the deviation from unity for the negative 
q spectrum is accentuated by the unequal frequencies 
of the four basepairs. In Fig. 2] the multifractal spec- 
trum arising from the CML assimilating a random ar- 
tificial DNA sequence with equal basepair distribution 
PA = Pc — Pg = Pt = 0.25 is compared to the case of 
non-equal basepair composition, Eq. (9). In the case of 
equal frequencies the CML process seems to create some 
rare configurations with small probability which tend to 
increase the Dg-values in the negative q region. 

Returning to Fig. [3] one observes that the closest ap- 
proximation to the Chromosome 10 spectrum is achieved 
for a ~ 0.35. This approximation is good only for posi- 
tive q— values, which correspond to positive moments of 
the probability density. This means that configurations 
which are found often in the genome are well represented 
by the CML process a = 0.35, while rare configurations, 
which dominate the negative-q spectra, are not accounted 
sufficiently in this approach. 

In the next section, we will improve the model by ad 
hoc introducing a small number of rare configurations. 



PA = 0.291921, PC = 0.207966, (9) 
PG = 0.207859, PT = 0.292219. 

A first look at Fig. [3l in the negative q region, indicates 
that the coarse graining of the state space into 4 segments 
and the reduction of the continuous CML dynamics to a 
4-symbol shift modifies the T4 spectrum, producing Dq > 
1 for g < 0. This is not surprising, since the multifractal 
spectrum presented in Eq. ^ relates to the statistics of 
the map and represents the frequency of iterates within 
the interval [-1,1], while Eq. ([9]) relates to the increase 
in the number of configurations in the symbolic sequence 
resulting from this map. 



IV. TAKING INTO ACCOUNT RARE 
OLIGONUCLEOTIDES 

To motivate the need for including rare oligonucleotide 
configurations, we first discuss some DNA functional is- 
sues related to the presence of specific rare segments. 
Long-time studies of the primary structure of higher eu- 
caryotes and other organisms have revealed certain par- 
ticularities related mainly to the functionality of DNA se- 
quences. In particular, in coding sequences all combina- 
tions of the four letters are found with almost equal prob- 
ability, without priority given to specific combinations. 
This is not true for the noncoding parts which comprise 
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95-97% of the human genome. In noncoding sequence 
repetitive elements are very common with only the Alu- 
(repetitive sequence) covering approximately 11% of the 
human genome ^3^. Other common elements which are 
often met in eucaryotcs are the poly-A and poly-T chains. 
Likewise, sequences with specific functionality are very 
rare and they are only present for specific purposes in 
the noncoding region. Well known such examples are the 
TATAA box and the GC and CG complexes and multi- 
ple superpositions of them (l6l - l2^ . The presence of these 
complexes is associated with the presence of promoters, 
regulatory elements which designate the subsequent ap- 
pearance of coding segments. These regulatory elements 
have the very specific task of "chemically attracting" the 
enzymes which will act on the closely following coding se- 
quence in order to start the production of RNA which will 
finally lead to the production of the corresponding pro- 
tein. Thus the presence of promoters (and their sequence 
structure) is very specific in the noncoding sequences and 
they are not abundantly found in the genome. Promot- 
ers are not the only sequences which are conserved for 
specific purposes. Other regulatory elements, such as 
the cis-acting and trans-acting elements, also have rare 
sequence structure. 

From the above discussion it becomes clear that the 
structure of the noncoding segments, which dominate the 
genome of higher eucaryotes, is far from being uniform. 
The presence of rare configurations, which is mostly vis- 
ible in the negative q spectrum, needs to be taken into 
account for a proper modelling of the sequence dynamics. 
Rare sequences with specificity, which are not accounted 
for by the simple CML model presented in the previous 
section, will be thus considered ad-hoc in this section. 
This addition will mostly contribute to the negative q— 
values of the multifractal spectrum, which is observed to 
be lower for the CML than for the human chromosomes. 

We modify the dynamics by assuming that some of the 
rare symbol sequences generated by the CML become 
even less frequent by an external coupling mechanism 
(such as escape from the chaotic attractor). It is suffi- 
cient to artificially modify the probability of occurrence 
of a small fraction 6 of configurations, eg. 9 ^ 1/1000, 
creating thus Q — 9 x E rarified configurations. The 
probability of occurrence of these rare configurations can 
be reduced to as much as 10~^ x (lowest — probability) for 
a much better approximation of the chromosomal mul- 
tifractal spectra. These values are indicative and they 
depend weakly on the specific chromosome. In Fig. [5] 
the multifractal spectrum of human chromosome 10 is 
plotted together with CML-r4 with coupling constant 
a = 0.35 and with modifications to include 8 = 60 rare 
configurations (blue line). For comparison the case of 
CML-r4 with coupling constant a = 0.35 without rare 
configurations is also included (green dotted line). Be- 
cause the modified configurations are few and rare, their 
contributions to the positive q region of the multifrac- 
tal spectrum is negligible. On the contrary, for neg- 
ative q they give an important contribution increasing 
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Figure 5: (Color online) The multifractal spectrum for human 
chromosome 10, with A'' = 8 (black solid line), and CML- 
T4 with coupling constant a = 0.35 modified with few rare 
configurations as described in the text (blue dashed line). The 
green dotted line depicts the CML-T4 representation without 
rare configurations. 

significantly the Dq values and good agreement is then 
achieved for both positive and negative q. Similar results 
are also obtained for the other human chromosomes, with 
adjustable values of a and 9. 

In Fig. ini we plot the variable a which denotes the 
mean square deviation between the multifractal curve of 
chromosome 10 and CML-T4's modified with rare config- 
urations for various values of a. 

1 ^ 

-'(")-]^ E (10) 

In Eq. ((To)) Dq^^^{a) denotes the multifractal exponent 
of order q for the CML of with coupling constant a, 
while D]^ denotes the corresponding multifractal expo- 
nent of order q for chromosome 10. The sum runs from 
—N to N over positive and negative g— values at equal 
distance A; A^i = (2iV -f 1)/A is the total number of q— 
values considered. Figure [S] shows that the smallest a 
value for chromosome 10 is found for a — 0.35 ±0.01 and 
thus the coupled Chebyshev string with coupling con- 
stant a = 0.35 ± 0.01 best represents the correlations 
found in chromosome 10. Similar values are also obtained 
for the other human chromosomes. 



V. CONCLUSION 

The multifractal spectrum of CMLs has been deter- 
mined using as a working example the 4-th order Cheby- 
shev map Ti diffusively coupled on a 1-D lattice with 
periodic boundary conditions. This choice of local map 
is particularly suited for our biological application since 
it corresponds to a shift of information encoded in 4 sym- 
bols, just as the DNA string encodes information using 
4 nucleotides. 
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Figure 6; The deviation a (black circles) as a function of 
a between the multifractal spectra of human chromosome 10 
and CML-T4's modified with rare configurations. The dotted 
line is a cubic fit to the data. 



In the current study the CML was used to assimilate 
the correlated structure of genomic sequences, based on 
a comparison of the corresponding multifractal spectra. 
It was shown that the CML of T4 can reproduce quite 
closely the multifractal spectrum of genomic sequences 
for positive g— values, while it deviates significantly for 
negative q. In particular, for human chromosome 10 
(complete sequence) the best approximation for the pos- 
itive q spectrum is obtained for coupling constant value 
a = 0.35. 

In an attempt to model both positive and negative 
q— spectra of chromosomes as closely as possible, we 
consider the differences of the frequency representation 
of various functional units. One particular property of 
noncoding DNA is the presence of rare configurations 
(oligonucleotide sequences) which have specific function- 



ality serving as promoters for the production of proteins 
or as regulatory elements. Such specific sequences are the 
TATAA-box, various GC-complexes and other elements 
which vary for different classes of organisms. We model 
these rare configurations by introducing an additional 
(artificial) escape process for the CML, which modifies 
the probabilities of certain rare sequences. If rare config- 
urations representing these particular sequences are con- 
sidered via an ad-hoc modification of the simulated dis- 
tribution in the symbolic sequences resulting from the 
CML of T4, then we see that both negative and posi- 
tive q multifractal spectra of genomic sequences are well 
approximated by the CMLs. 

It is interesting to mention here that a good represen- 
tation of the distribution of base pair sequences in DNA 
must be a superposition of (at least) two components. 
One of these components represents mostly the coding 
sequences and the second one contributes to the non- 
coding ones. This is in line with earlier studies of DNA 
sequences which have shown that the coding and non- 
coding parts follow different statistics, related to their 
different functionality [H [H [13 . 

In the current study, as a representative example, the 
human chromosome 10 was investigated in detail and the 
optimum value of the corresponding coupling constant of 
the CML was determined. Likewise, additional studies 
not described here have shown similar qualitative and 
quantitative behavior for the other human chromosomes. 
Further studies are required to show if the same approach 
can be applied to different classes of organisms, where 
the ratio of lengths of coding/noncoding sequences takes 
on different values. In a future study it will be very 
interesting to explore the range of values of the coupling 
constant a and the rare configuration frequency 9 that 
may characterize the different classes of organisms. 
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